Classification and Regression Tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of Archaea and Bacteria

 BETSEY DEXTER DYER,1 MICHAEL J. KAHN,2 and MARK D. LEBLANC2,3
1 Department of Biology, Wheaton College, Norton, MA 02766
2 Department of Math and Computer Science, Wheaton College, Norton, MA 02766
3 Author to whom correspondence should be addressed (mleblanc@wheatoncollege.edu)


Corresponding Author:
Mark D. LeBlanc
Department Mathematics and Computer Science
Wheaton College
Norton MA 02766
Phone:  508.286.3970
Fax: 508.286.8278
Email:  mleblanc@wheatoncollege.edu


 Summary
Classification and Regression Trees (CART) were applied to genome-wide tetranucleotide frequencies (genomic signatures) of 195 archaea and bacteria. Although genomic signatures have been used typically to classify divergences, in this case convergent evolution of genomes was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (non-linear) qualities of genomes may reflect certain environmental conditions (such as temperature) in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results.

Key Words:  Archaea, Bacteria, bioinformatics, CART (Classification and Regression Trees), convergence, decision tree, extremophile, genomic signature, hyperthermophile, purine-loading, temperature, tetranucleotide frequencies, thermophile, virtual coding strand
 


Introduction
It was Erwin Chargaff who first noticed the ratios of nucleotides in DNA and reported the similar frequencies of adenine and thymine and of cytosine and guanine in his influential 1949 paper (Chargaff et al. 1949). Chargaff sustained a lifelong interest in the complexities of genomes, in many cases speculating far ahead of the development of necessary methods and instrumentation. For example, in his Essays on Nucleic Acids (1963), Chargaff predicted the significance of determining frequencies of oligonucleotide motifs, to perform linguistic analyses of DNA "texts" (e.g. pg. 130, 131), and he foreshadowed cluster analyses of oligonucleotide frequencies (e.g. pg. 148).

"I knew that a great deal can be learned about an unknown language through a study of its phonemes, their frequency, distribution density, and allophonic relationships."  (pg. 130)

	A paucity of DNA sequences and constraints on computational methods limited the pace of research on motif frequencies through the 1980s. Microbiologists were aware of the cryptic importance of GC ratios as part of the characterization of archaea and bacteria, however Chargaff's initial momentum was not regained until the 1990s when Samuel Karlin and others began to apply computational analyses to a growing suite of completed genomes. 
	Many authors have noted what Chargaff first put into words, that there are aspects of natural language analysis in genomic signature comparisons (Karlin et al. 1994, Fertil et al. 2005, Paz et al. 2006, Vasilevskaya et al. 2006). In 1992, with only partially sequenced bacterial, archaeal, and eukaryotic genomes, as well as some completed virus genomes, Karlin and colleagues began exploring the relative abundances of di-, tri-, and tetra-nucleotides (Burge et al. 1992). Karlin et al. (1994) focused on dinucleotide relative abundances, referred to as a "robust signature to genomes." In Karlin and Ladunga (1994) and Karlin and Burge (1995) "genomic signature" was more formally defined as a discriminator between genomes. 
	Seven complete microbial genomes and partial genomes of others were available in 1997 for a genomic signature analysis using short oligonucleotides. In Karlin et al. (1997), the authors noted that such analyses are a departure from inter-genome comparisons using alignment techniques and comparisons of genomic signature may overcome some of the problems with alignments. The authors found enough correlation between signatures based on di-, tri-, and tetranucleotides that most subsequent studies have focused on dinucleotides as being sufficient discriminators. However, other authors (Pride 2003, Teeling et al. 2004, Fertil et al. 2005) have focused on tetramers as an effective trade-off between accuracy and speed of analysis.
	Since the initial work of Karlin and colleagues, numerous studies confirm the efficacy of a non-alignment, genomic signature approach to genome analysis. One recent example is van Passel et al. (2006a) who examined the dinucleotides of 334 bacterial and archaeal genomes. They note a congruency between genomic signatures and 16S RNA sequences. Confirming previous results (e.g. Karlin and Burge 1995, Pride 2003), they conclude that the genomic signature has a distinctive phylogenetic signal and may better reflect evolutionary relationships than single gene comparisons. Confidence in the technique resulted in the suggestion that based on genomic signatures, certain enteric gamma proteobacteria may be combined while species such as Buchnera aphidocola could be split.
	Genomic signature analyses are relatively unaffected on a large scale by the heterogeneities of horizontal gene transfers. Yet, the pervasiveness of genomic signatures (Jernigan and Baran 2002) allows intragenomic analysis in search of local heterogeneities, thus identifying horizontal transfers (Karlin 2001, Lio 2002, and Dufraigne et al. 2005). This property of pervasiveness also allows for the analysis and classification of plasmids (van Passel et al. 2006b) and short sequences, such as genome fragments of 30000-40000 bp (Teeling et al. 2004), 5000 bp (Woyke et al. 2006), and as low as 400 bp (Sandberg et al. 2001). This may be one of the reasons that the earliest studies of genomic frequencies based only on partial genomes were so promising.

Genomic Signatures as Potential Indicators of Convergent Evolution of Sequences
Genomic signature analysis seems to be applicable both for classifying divergent evolution and for classifying aspects of convergent evolution. Again, it may be the pervasiveness of the genomic signature that allows this versatility. 	

Several analyses of genomic signatures, such as Foerstner et al. (2005), have addressed convergent evolution of some sequences, selected by the environment, independent of phylogeny. Such selection pressures include physical constraints on the basic functions of the DNA molecule itself, as well as the structures of the proteins encoded. Archaea and bacteria of thermal springs have been obvious choices for comparing the effects of extreme temperatures on convergences of genome composition. Indeed, consideration of this question dates back to biochemical studies of AT/CG ratios and dinucleotide compositions in the 1960-1970's (reviewed in Karlin and Burge 1995).
	A caveat of Doolittle (1994) is that sequence convergence (a sort of "molecular mimicry") can be difficult to distinguish from horizontal transfers and from statistically insignificant short matches. The concerns listed by Doolittle are overcome (or lessened) by examining whole genomes, not just short sequences of particular genes or proteins. Indeed, genomic signature analyses may not only overcome the problem of horizontal transfers, but may point them out as heterogeneities within genomes as Karlin (2001), Lio (2002), and Dufraigne et al. (2005) have indicated.
	Campbell et al. (1999) examined signatures of five thermophilic archaea compared to 22 bacteria (mostly proteobacteria and Gram positives). In contrast to other studies, they concluded that thermophily, at least for these five species was not distinguished with genomic signatures. However, subsequent studies, including this one, suggest that there are genome-wide signatures for thermophilic and hyperthermophilic microorganisms.
	Trends in amino acid compositions of proteins ought to have some genome wide influence on codons and, therefore, some effect on a genomic signature. Amino acid use in thermophiles includes the replacement of polar, non-charged amino acids with charged amino acids such as lysine, arginine, aspartic acid, and glutamic acid (Cambillau and Claverie 2000, Suhre and Claveria 2003). A similar analysis of two psychrophilic archaea (Saunders et al. 2003) showed a bias for non-charged polar amino acids. Carbone et al. (2005) (and others reviewed in Carbone et al. 2005) determined that a "codon bias signature" separated thermophiles from mesophiles in a set of 16 archaea and 80 bacteria. 
	Karlin et al. (1994) and Karlin and Burge (1995) speculated that environmental influences such as pH, temperature, and salinity might influence dinucleotide genomic signatures. In a study of seven complete and several partial genomes, Karlin and Campbell (1997) noted that the three thermophiles had significantly low proportions of the dinucleotide CG. Kawashima et al. (2000) and Suhre and Claveria (2003) concluded that the "dinucleotide statistical index", computed from dinucleotide frequencies, showed more pure pyrimidine dinucleotides (TC combinations) and pure purine dinucleotides (AG combinations) in hyperthermophiles. An in vitro investigation by Xia et al. (2002) of Pasteurella multocida cultivated for approximately 14,400 generations at 45C (up from 37C) resulted in a decrease of GC% and an increase of TA, TT, and AA dimers. The fact that both coding and non-coding sequences have been determined to show genomic signatures (e.g. Karlin and Burge 1995, Karlin and Mrazek 1996, Campbell et al. 1999) supports hypotheses of sequence convergences. Convergent sequence signatures in non-coding regions may reflect similar environmental pressures on fundamental DNA activities such as replication and repair. 

Archaea versus Bacteria or Hyperthermophile versus Mesophile?
It should be noted that completely sequenced sets of hyperthermophiles currently are dominated by archaea. Therefore, it is important to differentiate between divergence based on phylogeny and convergence based on a "lifestyle" such as temperature preference. For example, Carbone et al. (2005) found signature differences in hyperthermophiles and thermophiles, from a study that began with separating bacteria from archaea. Graham et al. (2000) used coding sequences of nine archaea to find signature sets of genes, some pertaining to unique archaeal functions and metabolisms such as methanogenesis.  However, all of the nine archaea in the study were also hyperthermophiles, a factor that could be examined as well. Fadiel et al. (2003) probed for repeats of at least 25 bp in archaea and found "remarkable" signatures in non-coding regions. Again, the set of seven archaea consisted entirely of hyperthermophiles, while a comparison set of bacteria contained six mesophiles. The authors noted that denaturing conditions such as high temperatures may have selected for certain temperature-stable repeats.  

Classification and Regression Tree (CART) analyses as an indexing tool for the pervasive information of Genomic Signatures
In this study, genomic signatures were analyzed using CART (Breiman et al. 1984), a powerful tool for developing a classification scheme, categorizing an organismic characteristic on the basis of any number of classifying (predictor) variables. Some studies have used linear discriminant analysis (Carbone et al. 2005) to classify microbes based on codon biases. Other techniques have been used to classify fragments of DNA using nucleotide compositions. These include machine learning methods such as the self-organizing map (SOM) (e.g. Kohonen 1990, Abe et al. 2003, 2005, 2006) and support-vector-machine (SVM) (e.g. Tsirigos and Rigoutsos 2005, McHardy et al. 2007). In other studies (e.g. Lin et al. 2003), CART makes efficient use of large collections of classifying variables, allowing nonlinear relationships to be built among the classifiers, yielding a simple, sequential set of binary rules for classification. 
	Tree-based methods, such as CART, have been applied to the mining of large datasets, such as microarrays, to detect discriminating factors for classification (Boulesteix et al. 2003). CART methodology  (Hermanek 1994, Masic 1998) has been used to winnow through phenotypes, symptoms, and prognoses to create decision trees that might aid in the diagnoses of medical conditions. As such, CART is a tool for cataloguing or indexing, both of which are necessary as approaches to classifying organisms using the billions of base pairs from DNA sequences awaiting analysis. The ability to make a set of rapid, exploratory queries to seek discriminating variables is a worthy enterprise and a potential source of information. The practice is analogous to cataloguing a library or indexing a book. In a sense, the making of a catalogue or index is like the formulation of a hypothesis concerning the anticipated or potential meaning in a new grouping. 
	Genomic signatures are a significantly different paradigm (Karlin and Campbell 1997) from alignment methods, including synteny, and therefore require different comparison methods. It is the pervasive, linguistic quality of genomic signatures (unlike sequence alignment) that lends itself to classification schemes such as CART. This versatility of genomic signatures may seem contradictory:  to classify either convergent groups or divergent groups and to be unaffected by horizontal transfers on a global scale and yet to identify horizontal transfers on a local level.  Such versatility is one of the strengths of genomic signatures and is the reason that signatures may be indexed well using CART. 
	Genomic sequences have been most often catalogued or indexed by evolutionary phylogeny and the datasets are typically of genes, to the exclusion of non-coding regions. Much work remains to be done with other sorts of cataloguing and indexing systems. Genomic signatures as analyzed by CART afford that opportunity. In this study, we found CART to be especially useful in building classification schemes relating tetramer (or other oligomer) frequencies to characteristics of interest. The method is successful in revealing short lists of discriminating tetramers, the frequencies of which distinguished three temperature ranges, hyperthermophily, thermophily, and mesophily. Short decision trees were generated and shown to be effective predictors of known organisms external to the training data used to generate the trees. 

Materials and Methods
Sequence collection
One-hundred and ninety five (195) fully sequenced microbial genomes (24 archaeal and 171 bacterial species), protein tables, and associated annotations were downloaded from GenBank (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi, April 1st, 2006). Note that most fully sequenced hyperthermophiles at NCBI were archaea. A larger list was culled such that when species included more than one strain, just one strain was chosen and organisms not identified to genus and species were excluded. Of the 195 genomes, 16 are classified at NCBI as hyperthermophiles, 14 are thermophiles, and 165 are mesophiles (Table A2 in Supplemental Materials). 

Producing a virtual coding strand for each genome
A "virtual coding strand" for each genome was produced from the positive strand. A virtual coding strand is comprised of the entire positive strand, including all coding and non-coding regions. The virtual coding strand allows a more precisely controlled experimental system by eliminating the extra variable of gene orientation. To produce the virtual coding strand each incidence of a gene actually positioned on the complementary strand (as indicated by the associated protein table) was placed in "correct" (sense) 5' to 3' orientation by replacing the "reversed" (antisense) gene with its reverse complement. Coding regions with overlapping start-stop locations were handled independently and each region was concatenated onto the virtual coding strand. Any "corrected" genes (reversed or overlapping) were flanked by Ns at both ends to eliminate tetramer counts across these virtual intergenic-genic boundaries. In cases where organisms have more than one chromosome, the chromosomes were concatenated, separated by an 'N'. Thus a complete "sense" strand including all coding and non-coding regions with all genes in 5' to 3' order was constructed and used for subsequent steps in the analysis. 

Collecting Overlapping Tetramers (and other oligomers)
Counts of tetramers across each virtual coding strand were recorded by moving a four-base window through the strand one base at a time. Only tetramers containing A, C, G, and Ts were counted. Dimers, trimers, and pentamers were also collected and analyzed to determine the optimal size for the subsequent CART analyses.
 
Randomized control sequences 
Two variations of random sequences were constructed and used as controls for testing the external validity of each decision tree. First, for each of the 195 organisms, independent draws from a multinomial distribution, with probabilities set to the empirical proportions of A, C, G, T, and N within that organism's sequence, were made at each nucleotide site to generate a simulated genome of the same length as the original genome. This is analogous to rolling a five-sided die at each nucleotide to produce a pseudorandom strand with approximately the same nucleotide ratio as the original organism. Second, a first-order Markov chain model was used. Here, conditional on the value of the previous nucleotide, one of five different multinomial distributions were used to generate the simulated strand. Figuratively, we have five different five-sided dice, one each for the A, C, G, T, or N that occupies the previous nucleotide. If the previous nucleotide is an A, then the computer selects the five-sided die corresponding to A and "rolls it" to generate the current nucleotide, where the proportions of A, C, G, T, and N on A's die are the empirical proportions of AA, AC, AG, AT, and AN on the original genome. Thus, we are creating a pseudorandom strand with a first-order Markov model using the original genome's empirical, one-step transition probabilities (e.g. Ewens and Grant 2001). 

CART (Classification and Regression Trees)
Classification trees are decision trees, the purpose of which are to determine a set of logical conditions that provide classification of cases based on the values of a set of classifier variables. One advantage of classification trees is that they are non-parametric, assuming neither a linear, nor even a continuous, nonlinear relationship between the classifiers and the dependent variable.  Another important advantage is their simplicity. Classification trees often yield simpler models than their parametric counterparts, as well as straightforward classification of new cases. In this paper, archaeal and bacterial genomes are classified on the basis of the intra-genomic relative frequencies of tetramers. We use the freely available rpart implementation of CART (Breiman et al 1984) in the R-statistical package (R Development Core Team 2006). All trees were generated so that any node with nine, or fewer, observations was left as a leaf (i.e. not split) and each leaf contains at least 3 observations.
	To begin, the intra-genomic tetranucleotide or tetramer frequencies of all 195 genomes were used to build a CART model to classify genomes according to temperature range: hyperthermophile vs. non-hyperthermophile (thermophile and mesophile) and mesophile vs. non-mesophile (hyperthermophile and thermophile). The validity of these CART models as a basis of predicting temperature ranges on genomes not included in the model was then tested by leaving out one genome at a time, building a decision tree using the other 194 genomes. The resulting model was then used to classify the omitted genome. The process of predicting the one genome left out (referred to as "leave one out" in the results section) based on a model using the other 194 genomes was repeated for each of the 195 genomes. As a third test of the external predictive validity of the CART models, 50 randomly selected genomes were set aside and a classification tree was built using the remaining 145 genomes. The temperature ranges for each of the 50 omitted genomes was then predicted. Building a model on 145 genomes and predicting 50 randomly selected genomes was repeated over 40 iterations.
Additional controls 
Four additional controls were added to highlight the limits and versatility of CART analyses. First, the intra-genomic tetramer frequencies of all 195 original genomes were used to build a CART model to classify genomes according to temperature range and these CART models were used to classify the temperature range of the randomly generated genomes (both multinomial distribution and first-order Markov). Second, the temperature ranges for all 195 original genomes were permuted and randomly reassigned to all 195 organisms. Then 100 iterations of the "leave one out" analysis was repeated. Third, we used the GC ratios of each genome (Table 1) as the sole classifier in a CART analysis to classify microbial temperature regimes. Finally, the intra-genomic relative frequencies of tetramers were used to discriminate archaea from bacteria to demonstrate the versatility of this technique as applied to a pervasive genomic signature. 

Results
CART with Genomic Signatures: an accurate discriminator of Temperature Classifications
There were relatively few misclassification errors from the CART models and from tests of those models. Many of the misclassifications are discussed in further detail below because they may actually represent problems with consistent and uniform reporting of temperature optima and classifications rather than errors in the CART analyses. That is, CART seems to have caught several instances of organisms' temperature optima that either fail to reflect their reported temperature classifications or with optima on the border between two classifications. This parallels the experience of van Passel et al. (2006a) who, based on their confidence in the accuracy of genomic signatures, speculated that some "misclassified" bacteria in their study might actually have been correctly classified by their method. Some of the challenges of temperature classification are discussed further below. In the current discussion, the criteria of the latest edition of The Prokaryotes (Dworkin 1999) were followed:  Hyperthermophile  > 80C, Thermophile  60-80C, Mesophile 15-60C, Psychrophile  < 15C.
	The classification tree of the temperature ranges of hyperthermophiles vs. non-hyperthermophiles (the set of thermophiles and mesophiles) was almost error-free. Two of sixteen hyperthermophiles, Carboxydothermus hydrogenoformans and Thermotoga maritima, appeared at first to be misclassified by a tree designed to distinguish them from non-hyperthermophiles (Figure 1).  However, both of these species may not be true hyperthermophiles.  At NCBI, C. hydrogenoformans has a contradictory listing as a hyperthermophile with a temperature optimum of 78C; by the current definition in The Prokaryotes (Dworkin 1999), it should be placed in the thermophiles, as confirmed in this CART analysis. Thermotoga maritima is listed with a temperature optimum of 80C placing it directly on the boundary between temperature classifications.  All 179 non-hyperthermophiles were correctly classified (Figure 1).   
	The classification tree based on the tetramer frequencies over the virtual coding strands of all 195 genomes for mesophiles versus non-mesophiles (the set of hyperthermophiles and thermophiles) resulted in a total of nine misclassifications: Only two out of 165 mesophiles, Geobacter metallireducens and Geobacter sulfurreducens were misclassified (Figure 2). Seven out of 30 non-mesophiles were misclassified. This included the hyperthermophile Nanoarchaeum equitans, an exceptional archaeon in many ways, by being very small, having a tiny genome, having an extreme AT bias of 32%, being the only known parasitic or symbiotic archaeon with another archaeon (Ignicoccus) as a host, and being the sole member of its own phylum.
	Five of the "thermophiles" that at first seemed to be misclassified on the mesophile versus non-mesophile tree (see * in Figure 2) are listed here with their reported temperature optima: Chlorobium tepidum (48), Geobacillus kaustophilus (60), Methylococcus capsulatus (45), Streptococcus thermophilus (45) and Thermosynechococcus elongatus (55). All but G. kaustophilus fall well below the current bracket temperatures for thermophily, 60-80C.
	As explained in the materials and methods section, a "leave one out" analysis was then used to test the external validity of these models. Misclassified organisms, along with their temperature optima, are indicated in Table 1. In many cases, the CART analysis has revealed temperature optima that are incongruent with the NCBI-reported temperature classifications. In the 195 tests of models for hyperthermophiles versus non-hyperthermophiles, only 8 organisms were misclassified, including three not classified correctly in the original models discussed above: C. hydrogenoformans and Thermatoga maritima (which may not be true hyperthermophiles) and N. equitans (an unusual archaeon). The other five apparently misclassified organisms included the hyperthermophile Sulfolobus tokadaii with a borderline temperature optimum of 80C.  It also included Sulfolobus acidocaldarius reported with a 70-75C optimum at NCBI, but reported elsewhere at 80C (Chen et al. 2005) suggesting that its designation as a thermophile is not yet settled (Table 1).
	In the "leave one out" testing of 195 mesophile versus non-mesophile models, there were 14 misclassifications. However, upon further scrutiny, several more ambiguities in reporting temperature classifications were confirmed or revealed. Also included was the unusual N. equitans, which may be a true outlier in any dataset. In addition, Picrophilus torridus seemed to be misclassified as a mesophile, but has a borderline optimum of 60C.
	When testing the predictive ability of the CART models on 40 iterations of selecting 50 randomly selected genomes against classification trees based on the other 145 genomes, an average of 2.6 out of 50 genomes (5.2%) were misclassified on the hyperthermophile vs. non-hyperthermophile trees. An average of 5.1 out of 50 genomes (10.2%) were misclassified on the mesophile vs. non-mesophile trees.
	The total number of misclassifications when building models using all the genomes ("model with all") and the total number of missed predictions in the "leave one out" and "50 randomly selected genomes" analyses are shown in Table 2.
	When genome sequences were randomized by a single multinomial at each nucleotide, the classifications had at least twice the number of misclassifications as the original experiments (Table 3). Misclassifications in genomes generated by a first order Markov method were much closer to the original experiments. Indeed, in the hyperthermophile vs. non-hyperthermophile model, the error rate was identical although differed with respect to which organisms were misclassified. This suggests that considerable information exists in the dimer frequencies that comprise the tetramers, confirmed when we ran experiments on dimers and found them slightly less accurate than tetramers (see Additional Materials, Table A1).
	When temperature ranges for the 195 genomes were permuted and randomly reassigned to the 195 organisms, the resulting models in the leave-one-out-analyses yielded a substantial increase in the misclassification rate. When classifying hyperthermophiles vs. non-hyperthermophiles, the average misclassification rate was 14.4% (28.1 of 195 over 100 iterations). This yields about the same misclassification rate one would expect when ignoring all concomitant information and blindly classifying each organism independently with probabilities equal to their relative frequency in the sample. In this case, randomly classifying an organism as non-hyperthermophile with probability .918 (179/195) or a hyperthermophile with probability .082 (16/195) yields an expected misclassification rate of 16.4%, 8.2% in each of the two categories. Similarly, permuting the temperature ranges and randomly reassigning to the 195 organisms and then classifying mesophiles vs. non-mesophiles yields an average misclassification rate of 27% (52.8 of 195 over 100 iterations), whereas random classification based solely on relative frequency yields an expected misclassification rate of 30.8% (Table 4). In both cases, these misclassification rates from random data are substantially higher than the leave-one-out analysis when using the CART analysis. For hyperthermophile vs. non-hyperthermophile, the misclassification rate more than tripled (4.1% using the classification tree, 14.4% using random assignment) and for mesophile vs. nonmesophile, the misclassification rate nearly quadrupled (7.2% using the classification tree, 27% using random assignment). The percentage of GC's were not found to be an accurate predictor of temperature optima. This is confirmed by the poor performance (similar to "Multinomial" in Table 3) of CART in differentiating temperature optima (Table 5). On the other hand, genomic signatures have a history of usefulness in showing divergent relationships. As expected, a CART analysis of tetramers performed well in separating archaea from bacteria (Table 6). Using the CART analysis, about 21% (5/24) of archaea were misclassified and no bacteria (0/171) were misclassified. Using a random classification scheme based on the relative frequencies of archea and bacteria in the sample, we would expect about 87.7% (21/24) of archaea to be misclassified and about 12.3% (21/171) of bacteria to be misclassified. Again, the CART analysis yields a substantial improvement in misclassification rates.
	Tetramers were chosen as the classifier after parallel sets of experiments run with dimers, trimers, and pentamers confirmed a slightly better overall rate of correct classification using tetramers (Table A1). In addition, for each motif size of length L={2, 3, 4, 5}, each genome was cross-validated using a model built with the other 194 genomes. The number of tested organisms that were predicted incorrectly was recorded for each experimental classification (hyper vs. non-hyper, meso vs. non-meso) (Table A1). Further, a tree was built using both trimers and tetramers, simultaneously, as potential classifiers, yielding results similar to those from fitting a tree with tetramers alone. In particular, GAGA continued to be chosen as the best, first-decision classifier, even when trimers were considered along with tetramers. Based on these experimental trials, we found tetramers to lead to the most effective classification scheme. 

Discussion
Challenges of Temperature Classifications for Microorganisms
The division of microorganisms into four categories based on optimal growth temperatures, psychrophile, mesophile, thermophile, and hyperthermophile, owes much to the fact that mesophiles are so well studied, especially symbionts or pathogens of mesophilic humans, thriving at the body temperatures of humans. Furthermore, the range of temperatures on the classification scale has been modified over the years as organisms at extreme temperatures have not only been discovered but also found to be common, at least in certain habitats.  For example, the first edition of The Prokaryotes (Brock 1981) predates the confirmation of vast communities of microorganisms at boiling temperatures and indicates typical "high" temperature optima to be mostly between 50 and 70C. Extremes such as an isolate at 85C or activity at 92C were considered exceptional and the term "hyperthermophile" had not yet been coined (Brock 1981). The canonical temperature, 60C, at which many well-studied mesophilic proteins are denatured in the lab may also have influenced the early analyses of optimal temperatures.
	The present system of temperature classification is by no means settled and has some hallmarks of having been decided by a committee. New input includes discoveries of microbes at extreme temperatures and attempts to even out the temperature intervals. Meanwhile, the previous temperature classification system (still the template for the new one) is strongly biased toward mesophilic pathogens.
	Several species of archaea and bacteria are reported at NCBI, using somewhat different criteria, most likely reflecting the former system of temperature ranges for heat-loving microbes mostly found in hot compost, around 50C. Some of these are discussed in a previous section and noted in Table 1 and Figures 1 and 2 as being among the "misclassified" by the CART analyses.  An additional challenge is the disproportionate number of species reported at NCBI at the exact cut-off temperature of 60 or 80C (just qualifying for the realm of thermophily or hyperthermophily by the latest definitions,) suggesting the possibility of rounding. There are weak points with any classification system; in this case the attempt to divide along an equal (perhaps arbitrary) scale, with increments of 20C, may not reflect the actual diversity of the organisms. These observations are confirmed in part by Lobry and Chessel (2003) who also noticed discrepancies, or outliers in the range of 60C, in their mostly successful attempts to classify thermophiles via codon usage.

Genome sequence convergences at high temperatures 
Some physical properties of DNA may explain the CART classifications by temperature. These include correlations between nucleotide composition (e.g. AT percent and purine percent (loading index)), DNA melting temperatures, and mRNA stability. 
	 The AT content of a genome has a linear correlation with the melting temperature of that genome in vitro (e.g. Ussery 2001). Furthermore, the temperature scale for DNA melting experiments typically runs from 70 or 75C to 100C. That is, 70-75C, no matter what the AT content is, seems to be a significant temperature for any DNA in vitro, no matter what the composition. Therefore, organisms dwelling at 70-75C or more may be encountering (and presumably overcoming) greater challenges in keeping their genomes intact than organisms at cooler temperatures.  A dichotomy such as that may be reflected in the CART classification that fails to make a great discrimination between some mesophiles and thermophiles, as defined with a cut-off temperature of 60C.
  	Indeed, in many hyperthermophiles, an AT content less than 50% corresponds with higher temperature optima. However, there is no reliable rule connecting AT content and temperature optima. Among the exceptions are some of the lowest AT percentages found in various mesophilic symbionts or pathogens. Meanwhile, Thermus thermophilus at 68C has a rather high AT content of 69.4% (Table 1).
	Degradation of DNA at high temperatures may be more significant than melting (Marguet and Forterre 1994, 1998). Supercoiled, circular DNA seems to be relatively protected from melting, even at boiling temperatures although degradation of DNA at those temperatures is then followed by melting of the vulnerable linear fragments. Organisms living at high temperatures may use protective mechanisms such as supercoiling and this may explain why AT percent is not a reliable predictor of thermophily. Note that this does not necessarily preclude a set of adaptations commencing at 70C in which greater protective measures such as supercoiling prevent degradation and then melting.  Such adaptations might well be reflected in genomic sequences (e.g. that enhance coiling) and therefore reflected in genomic signatures of hyperthermophiles. 
	Understanding the genomic signatures of hyperthermophiles may be further complicated by the purine-loading index, the difference in numbers of purines (A and G) relative to pyrimidines (T and C) in a coding strand. A coding region composition with higher numbers of purines is correlated with a minimization of undesirable mRNA-mRNA interactions. Such entropy-driven interactions increase at higher temperatures and therefore thermophilic organisms may tend to have higher purine loading indexes, specifically by acquiring A and losing C, especially in coding regions (Lao and Forsdyke 2000).
	There were just a few discriminating tetramers in nearly every one of the 195 CART models generated for each of the temperature comparisons. The predominant tetramer in the models of hyperthermophile vs. non-hyperthermophile was GAGA, used in 192 out of 195 trees as the initial split parameter; the remaining three models used the tetramer AGAG. In all cases, a higher proportion of these tetramers was associated with the hyperthermophilic classification. In the models of mesophilic vs. non-mesophilic, AGGA was the primary split parameter in 193 cases; the remaining two were split on GAGA. In this case, a higher proportion of these tetramers was associated with the non-mesophilic classification. The most commonly used tetramers in secondary splits were ATCA (for the hyperthermophile vs. non-hyperthermophile) and GGGA and AACC (for mesophile vs. non-mesophile). As has been noted by other authors (e.g. Lao and Forsdyke 2000, Lambros et al. 2003, and Paz et al. 2004) many thermophilic genomes in their coding regions are purine-loaded and possess codon biases that reflect that loading. Our method of using a "virtual coding strand" eliminated the variable of gene orientation. This may accentuate the signal from such codon biases in nucleotide composition by assuring that all genic regions (and therefore all codons) are counted in the sense direction. 
	The structures and activities of protein sequences at high temperatures also are relevant to the classification of heat-loving microbes.  Codon biases and amino acid biases, which in turn affect codon composition have an effect on coding sequences. Such biases have been correlated with temperature (e.g. Lobry and Chessel 2003). However, temperature constraints on various protein functions affect codon use as well. Thus, an unambiguous set of rules for high temperature codons remains difficult to decipher.  


Conclusion
Genomic signatures as represented by intra-genomic relative frequencies of tetramers and analyzed by CART yielding effective discriminators of temperature optima in archaea and bacteria. The implication of these results extends beyond the examples examined for this study. CART may be used in explorations of a variety of hypotheses regarding genomic sequences. There may be practical applications including the identifications of temperature optima of partial genomes such as in some metagenomic analyses of communities. Furthermore, this approach may yield a more nuanced understanding of the phylogenies of high temperature organisms and may suggest explorations of other pressures on genomic convergences such as high salt conditions. Since CART uses a few signature oligonucleotides to construct trees, it may lead to identification of particular traits associated with those nucleotides. The method may also be seen as an in silico complement to genomic signature tags (Dunn et al. 2002) by which genomes are sampled for a subset of fragments reflecting the whole. CART queries also may reveal overrepresented sequences of some functional significance as in binding sites of non-coding regions. 
	The CART method is an indexing tool. As noted by many professional indexers (or cataloguers) of books, an index (or catalogue) is limited only by the interests, assumptions, and imagination of the indexer. An index, whether of a book or a genome, is the creative product bearing the imprint of the indexer. On the topic of the London Library catalogue system, A.S. Byatt (2001) wrote:
 "It represents order-it is helpful, it leads you to what you were trying to find, and also to what you needed, but did not know you needed to find. It also has the delightfully mad quality of heterogeneous things linked violently together by the arbitrary order of the alphabet."

Acknowledgements
We thank Robert Obar, Harvard Medical School and Shawn McCafferty, Biology Department, Wheaton College for their insights on portions of this work and Greg Williams for his database expertise. We thank the reviewers for their valuable input to the manuscript. Partial support for this work was provided by the National Science Foundation's Course, Curriculum, and Laboratory Improvement program (CCLI-EMD) under grant 0340761 and Susanne Woods and Molly Easo Smith of the Wheaton College Provost Office.


References
Abe, T., S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, and T. Ikemura. 2003. Informatics for unveiling hidden genome signatures. Genome Res. 13:693-702.
Abe, T., H. Sugawara, M. Kinouchi, S. Kanaya, and T. Ikemura. 2005. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12:281-290.
Abe, T., H. Sugawara, S. Kanaya, M. Kinouchi, and T. Ikemura. 2006. Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes. Gene 365:27-34.

Boulesteix, A.L., G. Tutz, and K. Strimmer. 2003. A CART-based approach to discover emerging patterns in microarray data. Bioinformatics 19(18):2465-2472.
Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Chapman and Hall, New York, NY. (368 pages).
Brock, T. 1981. Extreme thermophiles of the genera Thermus and Sulfolobus. In The Prokaryotes (Eds. Starr, M.P., H. Stolp, H.G. Truper, A. Balows, and H.G. Schlegel). Springer-Verlag, Berlin, NY, 978-989.

Burge, C., A.M. Campbell, and S. Karlin. 1992. Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. U.S.A. 89:1358-1362.

Byatt, A.S. 2001. Indexers and Indexes in Fact and Fiction (Ed: Hazel Bell). University of Toronto Press, 11-16.

Cambillau, C. and J.M. Claverie. 2000. Structural and genomic correlates of hyperthermostability. J. Biol. Chem. 275:32383-32386.

Campbell A., J. Mrazek, and S. Karlin. 1999. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.  Proc. Natl. Acad. Sci. U.S.A. 96:9184-9189.

Carbone, A., F. Kepes, and A. Zinovyev. 2005. Codon bias signatures, organization of microorganisms in codon space, and lifestyle. Mol. Biol. Evol.  22:547-561.

Chargaff, E., E. Vischer, R. Doniger, C. Green, and F. Misani. 1949. The composition of desoxypentose nucleic acids of thymus and spleen. J. Biol. Chem. 177:405-416.

Chargaff, E. 1963. Essays on Nucleic Acids. Elsevier, New York, NY (211 pages).

Chen, L, K. Brugger, M. Skovgaard, P. Redder, Q. She, E. Torarinsson, B. Greve, M. Awayez, A. Zibat, HP. Klenk, and RA. Garrett. 2005. The Genome of Sulfolobus acidocaldarius, a model organism of the Crenarcheota. J. Bacteriol. 187:4992-4999.

De'ath, Glenn. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83(4):1105-1117.
Doolittle, R.F. 1994. Convergent evolution: the need to be explicit.  Trends Biochem. Sci. 19:15-18.

Dufraigne, C., B. Fertil, S. Lespinats, A. Giron, and P. Deschavanne. 2005. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 33:1-14.

Dunn, J., S. McCorkle, L. Praissman, G. Hind, D. van der Lelie, W. Bahou, D. Gnatenko, and M. Krause. 2002. Genomic Signature Tags (GSTs): A system for profiling genomic DNA. Genome Res. 12:1756-1765.

Dworkin, M. (Ed. in chief) 1999. The Prokaryotes: an evolving electronic resource for the microbiological community. Springer-Verlag, New York, NY.

Ewens, W. J. and G. R. Grant. 2001. Statistical Methods in Bioinformatics: An Introduction. Springer-Verlag, New York, NY, 303-310.

Fadiel, A., S. Lithwick, G. Ganji, and S.W. Scherer. 2003. Remarkable sequence signatures in archaeal genomes. Archaea 1:185-190.

Fertil, B., M. Massin, S. Lespinats, C. Devic, P. Dumee, and A. Giron. 2005. GENSTYLE:  exploration and analysis of DNA sequences with genomic signature. Nucleic Acids Res. 33:W512-W515.

Foerstner, K.U., von Mering, C., Hooper, S.D., and Bork, P. 2005. Environments shape the nucleotide composition of genomes. EMBO Rep. 6(12):1208-1213.

Graham, D.E., R. Overbeek, G.J. Olsen, and C.R. Woese. 2000. An archaeal genome signature. Proc. Natl. Acad. Sci. U.S.A. 97:3304-3308.

Hermanek, P. and I. Guggenmoos-Holzmann. 1994. Classification and regression trees (CART) for estimation of prognosis in patients with gastric carcinoma. J. Cancer Res. and Clinical Oncology. 120(5):309-313.

Jernigan, R.W. and R.H. Baran. 2002. Pervasive properties of the genomic signature. BMC Bioinformatics 3:1-12.

Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 9:335-343.

Karlin, S. and C. Burge. 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11:283-290.

Karlin, S. and I. Ladunga. 1994. Comparisons of eukaryotic genomic sequences. Proc. Natl. Acad. Sci. U.S.A. 91:12832-12836.

Karlin, S., I. Ladunga, and B.E. Blaisdell. 1994. Heterogeneity of genomes: measures and values. Proc. Natl. Acad. Sci. U.S.A. 91:12837-12841.

Karlin, S. and J. Mrazek. 1996. What drives codon choices in human genes?  J. Mol. Biol. 262:459-472. 

Karlin, S., J. Mrazek, and A. Campbell. 1997. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol. 179:3899-3913.

Kawashima, T., N. Amano, H. Koike, S. Makino, Y. Higuchi, Y. Kawashima-Ohya, K. Watanabe, M. Yamazaki, K. Kanehori, Y. Kawamoto, H. Aramaki, K. Makino, and M. Suzuki. 2000. Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermoplasma volcanium. Proc. Natl. Acad. Sci. U.S.A. 97:14257-14262.

Kohonen, T. 1990. The self-organizing map. Proc. IEEE 78:1464-1480.

Lambros, R.J., J.R. Mortimer, and D.R. Forsdyke. 2003. Optimum growth temperature and base composition of open reading frames in prokaryotes. Extremophiles 7:443-450.

Lao, P. and D. Forsdyke. 2000. Thermophilic bacteria strictly obey Szybalski's Transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Res. 10:228-236.

Lin, S., S. Patel, A. Duncan, and L. Goodwin. 2003. Using decision trees and support vector machines to classify genes by names. Proceedings of the European Workshop on Data Mining and Text Mining for Bioinformatics, 35-41.
Lio, P. 2002. Investigating the relationship between genome structure, composition, and ecology in prokaryotes. Mol. Biol. Evol. 19:789-800.

Lobry, J.R. and D. Chessel. 2003. Internal corrspondence analysis of codon and amino-acid usage in thermophilic bacteria. J. Appl. Genet. 44:253-261.

Marguet E. and P. Forterre. 1994. DNA stability at temperatures typical for hyperthermophiles. Nucleic Acids Res. 22:1681-6.

Marguet E. and P. Forterre. 1998. Protection of DNA by salts against thermodegradation at temperatures typical for hyperthermophiles. Extremophiles 2:115-122.

Masic, N., A. Gagro, S. Rabati, A. Sabioncello, G. Dai, B. Jaki, and B. Vitale. 1998. Decision-tree approach to the immunophenotype-based prognosis of the B-cell chronic lymphocytic leukemia. American J. of Hematology 59(2):143-148.

McHardy, A.C., H.G. Martin, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos. 2007. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4:63-72.

Paz, A., V. Kirzhner, E. Nevo, and A. Korol. 2006. Coevolution of DNA-interacting proteins and genome "dialect". Mol. Biol. Evol. 23:56-64.

Paz, A., D. Mester, I. Baca, E. Nevo, and A. Korol. 2004. Adaptive role of increased frequency of polypurine tracts in mRNA sequencs of thermophilic prokaryotes. Proc. Natl. Acad. Sci. U.S.A. 101:2951-2956.

Pride, D.T., R.J. Meinersmann, T.M. Wassenaar, and M.J. Blaser (2003). Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13:145-158.

R Development Core Team (2006). R: A language and environment for statistical computing.  R Foundation for Statistical Computing, Vienna, Austria. 
	http://www.R-project.org.

Sandberg R., G. Winberg, C.I. Branden, A. Kaske, I. Ernberg, J. Coster (2001). Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 11:1404-1409.

Saunders, N.F.W., T. Thomas, P.M.G. Curmi, J.S. Mattick, E. Kuczek, R. Slade, J. Davis, P.D. Franzmann, D. Boone, K. Rusterholtz, R. Feldman, C. Gates, S. Bench, K. Sowers, K. Kadner, A. Aerts, D. Paramvir, C. Detter, T. Glavina, S. Lucas, P. Richardson, F. Larimer, L. Hauser, M. Land, and R. Cavicchioli. 2003. Mechanisms of thermal adaptation revealed from the genomes of the antarctic archaea Methanogenium frigidum and Methanococcoides burtonii. Genome Res. 13:1580-1588.

Suhre, K. and J.M. Claverie. 2003. Genomic correlates of hyperthermostability, an update. J. Biol. Chem. 278:17198-17202.

Teeling, H., A. Meyerdierks, M. Bauer, R. Amann, and F.O. Glockner. 2004. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6:938-947.

Tsirigos, A. and I. Rigoutsos. 2005. A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res. 33:3699-3707.

Ussery, D.W. 2001. DNA denaturation. Encyclopedia of Genetics. Academic Press, New York, NY, 550-553.

van Passel, M.W.J., E.E. Kuramae, A.C.M. Luyf, A. Bart, and T. Boekhout. 2006a. The reach of the genome signature in prokaryotes. BMC Evolutionary Biology. 6:1-27.

van Passel, M.W.J., A. Bart, A.C.M. Luyf, A.H.C. van Kampen, and A. van der Ende. 2006b. Compositional discordance between prokaryotic plasmids and host chromosomes. BMC Genomics. 7:26.

Vasilevskaya, V.V., L.V. Gusev, and A.R. Khokhlov. 2006. Protein sequences as literature text. Macromol. Theory Simul. 15:425-431.

Woyke, T., H. Teeling, N.N. Ivanova, M. Huntemann, M. Richter, F.O. Gloeckner, D. Boffelli, I.J. Anderson, K.W. Barry, H.J. Shapiro, E. Szeto, N.C. Kyrpides, M. Mussmann, R. Amann, C. Bergin, C. Ruehland, E.M. Rubin, and N. Dubilier, N. (2006). Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443:950-955.

Xia X., T. Wei, Z. Xie, and A. Danchin. 2002. Genomic changes in nucleotide and dinucleotide frequencies in Pasteurella multocida cultured under high temperature. Genetics 161(4):1385-94.









Figure 1.  Classification tree of Hyperthermophile (Hyper) vs. Non-Hyperthermophile (Mesophile and Thermophile, MesoTherm) based on tetramer relative frequencies of 195 genomes. Each split shows the tetramer selected by CART for each level of classification and the relative frequencies for each direction. Nodes show the classified temperature range (MesoTherm or Hyper) and number of organisms classified in each category (MesoTherm/Hyper). Organisms misclassified in this model are listed under the appropriate nodes.










Figure 2.  Classification tree of Mesophile (Meso) vs. Non-Mesophile (Hyperthermophile and Thermophile, NonMeso) based on tetramer relative frequencies of 195 genomes. Splits show the tetramer selected by CART for each level of classification and the relative frequencies for each direction. Nodes show the classified temperature range (Meso or NonMeso) and number of organisms classified in each category (NonMeso/Meso). Organisms misclassified in this model are listed under the appropriate nodes. * Indicates organisms that may not be misclassified given current understanding of temperature optima (see Discussion).


Table 1. List of 16 Hyperthermophilic (12 Archaea, 4 Bacteria) and 14 Thermophilic (5 Archaea, 9 Bacteria) and NCBI-reported optimal growth temperature and GC-content. Organism names are listed with the NCBI filename convention of Genus species strain. An additional 165 Mesophilic (7 Archaea, 158 Bacteria) were also included.
Kingdom
Temperature Range
as reported 
at NCBI
Optimal Growth Temp (C)    
GC%
Organism                                                
Leave one out mispredictions





Hyper vs. Non-Hyper
Meso vs.  Non-Meso
Archaea
Hyperthermophilic
103
42.0
Pyrococcus abyssi
 
 
Archaea
Hyperthermophilic
100
52.0
Pyrobaculum aerophilum
 
 
Archaea
Hyperthermophilic
100
42.0
Pyrococcus furiosus
 
 
Archaea
Hyperthermophilic
98
60.0
Methanopyrus kandleri
x
 
Archaea
Hyperthermophilic
98
42.0
Pyrococcus horikoshii
 
 
Bacteria
Hyperthermophilic
96
43.0
Aquifex aeolicus
 
 
Archaea
Hyperthermophilic
93
67.0
Aeropyrum pernix
 
 
Archaea
Hyperthermophilic
90
31.6
Nanoarchaeum equitans
x
x
Archaea
Hyperthermophilic
85
31.3
Methanococcus jannaschii
 
 
Archaea
Hyperthermophilic
85
35.8
Sulfolobus solfataricus
 
 
Archaea
Hyperthermophilic
85
52.0
Thermococcus kodakaraensis KOD1
 
 
Archaea
Hyperthermophilic
83
46.0
Archaeoglobus fulgidus
 
 
Archaea
Hyperthermophilic
80
32.8
Sulfolobus tokodaii
x
 
Bacteria
Hyperthermophilic
80
45.0
Thermotoga maritima
x
 
Bacteria
Hyperthermophilic
78
42.0
Carboxydothermus hydrogenoformans
x
x
Bacteria
Hyperthermophilic
75
37.6
Thermoanaerobacter tengcongensis
 
x
Archaea
Thermophilic
72
36.7
Sulfolobus acidocaldarius DSM 639 
x
 
Bacteria
Thermophilic
68
69.4
Thermus thermophilus HB27
 
 
Archaea
Thermophilic
65
49.5
Methanobacterium thermoautotrophicum
 
 
Bacteria
Thermophilic
60
52.0
Geobacillus kaustophilus HTA426
 
x
Archaea
Thermophilic
60
36.0
Picrophilus torridus DSM 9790
 
x
Bacteria
Thermophilic
60
68.7
Symbiobacterium thermophilum IAM14863
 
 
Archaea
Thermophilic
60
50.0
Thermoplasma volcanium
 
 
Archaea
Thermophilic
59
50.0
Thermoplasma acidophilum
 
 
Bacteria
Thermophilic
58
55.8
Moorella thermoacetica ATCC 39073
 
 
Bacteria
Thermophilic
55
53.9
Thermosynechococcus elongatus
 
x
Bacteria
Thermophilic
52
67.5
Thermobifida fusca YX
 
 
Bacteria
Thermophilic
48
56.0
Chlorobium tepidum TLS
 
x
Bacteria
Thermophilic
45
63.6
Methylococcus capsulatus Bath
 
x
Bacteria
Thermophilic
45
40.0
Streptococcus thermophilus CNRZ1066
 
x




Table 2.  The number of misclassifications for two experimental groupings of temperature range (Hyper vs. Non-Hyper and Meso vs. Non-Meso) using all tetramer frequencies over the entire virtual coding strand.
Temperature Range Grouping
Model with All                Number of misclassifications when building a model using all 195 genomes
Leave One Out Prediction Number of missed predictions when training on 194 genomes and attempting to classify the one left out
Predict 50 Randomly
Selected Genomes
Average number (over 40 iterations) of misclassifications when training on 145 genomes and attempting to classify 50 randomly selected genomes
Hyper vs. Non-Hyper
2 (1%)
8 (4.1%)
(of 8 missed, 5/16 were Hypers, 1/14 Therms, and 2/165 Mesos)
2.6 (5.2%)
Meso vs. Non-Meso
9 (4.6%)
14 (7.2%) 
(of 14 missed, 3/16 were Hypers, 6/14 Therms, and 5/165 Mesos)
5.1 (10.2%)





Table 3.  The number of incorrect predictions of individual genomes for two experimental groupings of temperature range (Hyper vs. Non-Hyper and Meso vs. Non-Meso) using all tetramer frequencies over the entire virtual coding strand for 195 original genomes and two types of randomly generated control sequences.
Temperature Range Grouping
Leave One Out Prediction
Number of missed predictions

Original Genomes
Random Control


Multinomial
Markov
Hyper vs. Non-Hyper
8 (4.1%)
16 (8.2%)
8 (4.1%)
Meso vs. Non-Meso
14 (7.2%)
59 (30.3%)
21 (10.8%)



Table 4. Classification results (number of missed predictions) when the temperature ranges 
for all 195 original genomes were permuted and randomly reassigned to all 195 organisms.
Control
Leave One Out Prediction                                                Average number of missed predictions over 100 iterations when training on 194 genomes and attempting to 
classify the one left out

Hyper vs. Non-Hyper
Meso vs. Non-Meso
Randomly permuted temperature ranges
28.1 (14.4%)
52.8 (27.0%)




Table 5. Classification results (number of misclassified and missed predictions) using GC percentage to model and predict temperature range. 
Control
Model with All                                                          Number of misclassifications when building a model using GC% of all 195 genomes
Leave One Out Prediction                                      Number of missed predictions when training on GC% of 194 genomes and attempting to classify the one left out

Hyper vs.
Non-Hyper
Meso vs. Non-Meso
Hyper vs. Non-Hyper
Meso vs. Non-Meso
GC%
15 (7.7%)
27 (13.8%)
23 (11.8%)
42 (21.5%)




Table 6. Classification results (number of misclassified and missed predictions) using tetramers to model and predict Kingdom (Archaea vs. Bacteria).
Control
Model with All                                                          Number of misclassifications when building a model using 4mers of all 195 genomes
Leave One Out Prediction                                      Number of missed predictions when training on 4mers of 194 genomes and attempting to classify the one left out
Archaea vs. Bacteria
24 Archaea
171 Bacteria
5 (2.6%)
all 5 wrong were Archaea

15 (7.7%)
9/15 wrong were Archaea
6/15 wrong were Bacteria



Additional Materials (to appear on a website if deemed appropriate)
Determination to use tetramers
Virtual coding strands for each of the 195 selected genomes were built and analyzed for motif frequencies of lengths two to five (dimers, trimers, tetramers, and pentamers) using a suite of Perl scripts developed in-house. 
	For each motif size of length L={2, 3, 4, 5}, classification models were built using all 4L motifs within each genome.  For example, for tetramers (L=4), each genome has 44 = 256 motif frequencies. The number of misclassified organisms was recorded for each experimental classification (hyper vs. non-hyper, meso vs. non-meso) when using all the organisms to build a model (Table A1), as well as in the "leave one out" analyses.


Table A1.  Classification results using all 4L oligonucleotide frequencies over the entire virtual coding strand. Tetramers provide the most effective classification scheme considering both models built with all organisms and when individual organisms are predicted against models built with the other 194 organisms.
Motif length
(L-mer)
Model with All                                                          Number of misclassifications when building a model using all 195 genomes
Leave One Out Prediction                                      Number of missed predictions when training on 194 genomes and attempting to classify the one left out

Hyper vs.
Non-Hyper
Meso vs. Non-Meso
Hyper vs. Non-Hyper
Meso vs. Non-Meso
2-mer
5 (2.6%)
8 (4.1%)
10 (5.1%)
24 (12.3%)
3-mer
4 (2.1%)
6 (3.1%)
12 (6.2%)
14 (7.2%)
4-mer
2 (1.0%)
9 (4.6%)
8 (4.1%)
14 (7.2%)
5-mer
2 (1.0%)
9 (4.6%)
13 (6.7%)
23 (11.8%)




Table A2  A list of the 195 organisms used in this study (24 Archaea and 171 Bacteria) completely sequenced as of April 1st, 2006 as downloaded from GenBank (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi)


Temperature Range
24 Archaea
1
Hyperthermophilic
Aeropyrum pernix
2
Hyperthermophilic
Archaeoglobus fulgidus
3
Hyperthermophilic
Methanococcus jannaschii
4
Hyperthermophilic
Methanopyrus kandleri
5
Hyperthermophilic
Nanoarchaeum equitans
6
Hyperthermophilic
Pyrobaculum aerophilum
7
Hyperthermophilic
Pyrococcus abyssi
8
Hyperthermophilic
Pyrococcus furiosus
9
Hyperthermophilic
Pyrococcus horikoshii
10
Hyperthermophilic
Sulfolobus solfataricus
11
Hyperthermophilic
Sulfolobus tokodaii
12
Hyperthermophilic
Thermococcus kodakaraensis KOD1
13
Mesophilic
Haloarcula marismortui ATCC 43049
14
Mesophilic
Halobacterium sp
15
Mesophilic
Methanococcus maripaludis S2
16
Mesophilic
Methanosarcina acetivorans
17
Mesophilic
Methanosarcina barkeri fusaro
18
Mesophilic
Methanosarcina mazei
19
Mesophilic
Methanosphaera stadtmanae
20
Thermophilic
Methanobacterium thermoautotrophicum
21
Thermophilic
Picrophilus torridus DSM 9790
22
Thermophilic
Sulfolobus acidocaldarius DSM 639
23
Thermophilic
Thermoplasma acidophilum
24
Thermophilic
Thermoplasma volcanium


Temperature Range
171 Bacteria
1
Hyperthermophilic
Aquifex aeolicus
2
Hyperthermophilic
Carboxydothermus hydrogenoformans Z-2901
3
Hyperthermophilic
Thermoanaerobacter tengcongensis
4
Hyperthermophilic
Thermotoga maritima
5
Mesophilic
Acinetobacter sp ADP1
6
Mesophilic
Agrobacterium tumefaciens C58 Cereon
7
Mesophilic
Azoarcus sp EbN1
8
Mesophilic
Bacillus anthracis Ames
9
Mesophilic
Bacillus cereus ATCC 10987
10
Mesophilic
Bacillus halodurans
11
Mesophilic
Bacillus licheniformis ATCC 14580
12
Mesophilic
Bacillus subtilis
13
Mesophilic
Bacillus thuringiensis konkukian
14
Mesophilic
Bacteroides fragilis NCTC 9434
15
Mesophilic
Bacteroides thetaiotaomicron VPI-5482
16
Mesophilic
Bartonella henselae Houston-1
17
Mesophilic
Bartonella quintana Toulouse
18
Mesophilic
Bdellovibrio bacteriovorus
19
Mesophilic
Bifidobacterium longum
20
Mesophilic
Bordetella bronchiseptica
21
Mesophilic
Bordetella parapertussis
22
Mesophilic
Bordetella pertussis
23
Mesophilic
Borrelia burgdorferi
24
Mesophilic
Borrelia garinii PBi
25
Mesophilic
Bradyrhizobium japonicum
26
Mesophilic
Brucella abortus 9-941
27
Mesophilic
Brucella melitensis
28
Mesophilic
Brucella suis 1330
29
Mesophilic
Buchnera aphidicola
30
Mesophilic
Burkholderia mallei ATCC 23344
31
Mesophilic
Burkholderia pseudomallei 1710b
32
Mesophilic
Burkholderia thailandensis E264
33
Mesophilic
Campylobacter jejuni
34
Mesophilic
Caulobacter crescentus
35
Mesophilic
Chlamydia muridarum
36
Mesophilic
Chlamydia trachomatis
37
Mesophilic
Chlamydophila abortus S26 3
38
Mesophilic
Chlamydophila caviae
39
Mesophilic
Chlamydophila pneumoniae AR39
40
Mesophilic
Chromobacterium violaceum
41
Mesophilic
Clostridium acetobutylicum
42
Mesophilic
Clostridium perfringens
43
Mesophilic
Clostridium tetani E88
44
Mesophilic
Corynebacterium diphtheriae
45
Mesophilic
Corynebacterium efficiens YS-314
46
Mesophilic
Corynebacterium glutamicum ATCC 13032 Bielefeld
47
Mesophilic
Corynebacterium jeikeium K411
48
Mesophilic
Coxiella burnetii
49
Mesophilic
Dehalococcoides CBDB1
50
Mesophilic
Dehalococcoides ethenogenes 195
51
Mesophilic
Deinococcus radiodurans
52
Mesophilic
Desulfovibrio desulfuricans G20
53
Mesophilic
Desulfovibrio vulgaris Hildenborough
54
Mesophilic
Ehrlichia ruminantium Gardel
55
Mesophilic
Enterococcus faecalis V583
56
Mesophilic
Erwinia carotovora atroseptica SCRI1043
57
Mesophilic
Escherichia coli K12
58
Mesophilic
Frankia CcI3
59
Mesophilic
Fusobacterium nucleatum
60
Mesophilic
Geobacter metallireducens GS-15
61
Mesophilic
Geobacter sulfurreducens
62
Mesophilic
Gloeobacter violaceus
63
Mesophilic
Gluconobacter oxydans 621H
64
Mesophilic
Haemophilus ducreyi 35000HP
65
Mesophilic
Haemophilus influenzae
66
Mesophilic
Hahella chejuensis KCTC 2396
67
Mesophilic
Helicobacter hepaticus
68
Mesophilic
Helicobacter pylori 26695
69
Mesophilic
Idiomarina loihiensis L2TR
70
Mesophilic
Lactobacillus acidophilus NCFM
71
Mesophilic
Lactobacillus johnsonii NCC 533
72
Mesophilic
Lactobacillus plantarum
73
Mesophilic
Lactobacillus sakei 23K
74
Mesophilic
Lactococcus lactis
75
Mesophilic
Legionella pneumophila Lens
76
Mesophilic
Leifsonia xyli xyli CTCB0
77
Mesophilic
Leptospira interrogans serovar Copenhageni
78
Mesophilic
Listeria innocua
79
Mesophilic
Listeria monocytogenes
80
Mesophilic
Magnetospirillum magneticum AMB-1
81
Mesophilic
Mannheimia succiniciproducens MBEL55E
82
Mesophilic
Mesoplasma florum L1
83
Mesophilic
Mesorhizobium loti
84
Mesophilic
Mycobacterium avium paratuberculosis
85
Mesophilic
Mycobacterium bovis
86
Mesophilic
Mycobacterium leprae
87
Mesophilic
Mycobacterium tuberculosis CDC1551
88
Mesophilic
Mycoplasma capricolum ATCC 27343
89
Mesophilic
Mycoplasma gallisepticum
90
Mesophilic
Mycoplasma genitalium
91
Mesophilic
Mycoplasma hyopneumoniae 232
92
Mesophilic
Mycoplasma mobile 163K
93
Mesophilic
Mycoplasma mycoides
94
Mesophilic
Mycoplasma penetrans
95
Mesophilic
Mycoplasma pneumoniae
96
Mesophilic
Mycoplasma pulmonis
97
Mesophilic
Mycoplasma synoviae 53
98
Mesophilic
Neisseria gonorrhoeae FA 1090
99
Mesophilic
Neisseria meningitidis MC58
100
Mesophilic
Nitrobacter winogradskyi Nb-255
101
Mesophilic
Nitrosomonas europaea
102
Mesophilic
Nitrosospira multiformis ATCC 25196
103
Mesophilic
Nocardia farcinica IFM10152
104
Mesophilic
Oceanobacillus iheyensis
105
Mesophilic
Pasteurella multocida
106
Mesophilic
Pelobacter carbinolicus
107
Mesophilic
Pelodictyon luteolum DSM 273
108
Mesophilic
Photorhabdus luminescens
109
Mesophilic
Porphyromonas gingivalis W83
110
Mesophilic
Prochlorococcus marinus CCMP1375
111
Mesophilic
Propionibacterium acnes KPA171202
112
Mesophilic
Pseudomonas aeruginosa
113
Mesophilic
Pseudomonas fluorescens Pf-5
114
Mesophilic
Pseudomonas putida KT2440
115
Mesophilic
Pseudomonas syringae
116
Mesophilic
Ralstonia eutropha JMP134
117
Mesophilic
Ralstonia solanacearum
118
Mesophilic
Rhodobacter sphaeroides 2 4 1
119
Mesophilic
Rhodopseudomonas palustris CGA009
120
Mesophilic
Rhodospirillum rubrum ATCC 11170
121
Mesophilic
Rickettsia conorii
122
Mesophilic
Rickettsia prowazekii
123
Mesophilic
Rickettsia typhi wilmington
124
Mesophilic
Salinibacter ruber DSM 13855
125
Mesophilic
Salmonella enterica Choleraesuis
126
Mesophilic
Salmonella typhi
127
Mesophilic
Salmonella typhimurium LT2
128
Mesophilic
Shewanella oneidensis
129
Mesophilic
Shigella boydii Sb227
130
Mesophilic
Shigella dysenteriae
131
Mesophilic
Shigella flexneri 2a
132
Mesophilic
Shigella sonnei Ss046
133
Mesophilic
Sinorhizobium meliloti
134
Mesophilic
Sodalis glossinidius morsitans
135
Mesophilic
Staphylococcus aureus aureus MRSA252
136
Mesophilic
Staphylococcus epidermidis ATCC 12228
137
Mesophilic
Staphylococcus haemolyticus
138
Mesophilic
Staphylococcus saprophyticus
139
Mesophilic
Streptococcus agalactiae 2603
140
Mesophilic
Streptococcus pneumoniae R6
141
Mesophilic
Streptococcus pyogenes M1 GAS
142
Mesophilic
Streptomyces coelicolor
143
Mesophilic
Synechococcus elongatus PCC 6301
144
Mesophilic
Thiomicrospira crunogena XCL-2
145
Mesophilic
Thiomicrospira denitrificans ATCC 33889
146
Mesophilic
Treponema denticola ATCC 35405
147
Mesophilic
Treponema pallidum
148
Mesophilic
Tropheryma whipplei TW08 27
149
Mesophilic
Ureaplasma urealyticum
150
Mesophilic
Vibrio cholerae
151
Mesophilic
Vibrio fischeri ES114
152
Mesophilic
Vibrio parahaemolyticus
153
Mesophilic
Vibrio vulnificus CMCP6
154
Mesophilic
Wigglesworthia brevipalpis
155
Mesophilic
Wolinella succinogenes
156
Mesophilic
Xanthomonas campestris
157
Mesophilic
Xanthomonas citri
158
Mesophilic
Xanthomonas oryzae KACC10331
159
Mesophilic
Xylella fastidiosa
160
Mesophilic
Yersinia pestis biovar Mediaevails
161
Mesophilic
Yersinia pseudotuberculosis IP32953
162
Mesophilic
Zymomonas mobilis ZM4
163
Thermophilic
Chlorobium tepidum TLS
164
Thermophilic
Geobacillus kaustophilus HTA426
165
Thermophilic
Methylococcus capsulatus Bath
166
Thermophilic
Moorella thermoacetica ATCC 39073
167
Thermophilic
Streptococcus thermophilus CNRZ1066
168
Thermophilic
Symbiobacterium thermophilum IAM14863
169
Thermophilic
Thermobifida fusca YX
170
Thermophilic
Thermosynechococcus elongatus
171
Thermophilic
Thermus thermophilus HB27


??

??

??

??

Dyer, Kahn, and LeBlanc		Page 44 

		 

